Evaluation of pulmonary single‐cell identity specificity in scRNA‐seq analysis

The single cell RNA sequencing (scRNA-seq) technology provides new insights into understanding of single-cell transcriptomic atlas and intercellular communication. 1 scRNA-seq is used to characterize the considerable heterogeneity and complexity of cell type, to uncover the cell fate and context, critical molecular features and progression trajectories, as well as to explore potential pathogenesis and individualized therapeutic targets. 2–4 One of major challenges is to identify of biology-specific biomarkers as the cell identity for cell phenotypes and functional subtypes, although bioinformatic and fstatistical methods for scRNA-seq are developed and improved rapidly. 5 More and more cell type/subtypes are identified in response to various stimulus, external challenge and even pathological conditions. The correctness and specificity of cell-specific transcriptomic profiles based on scRNA-seq are highly dependent upon the accuracy of cell type identity and cel-lular annotation. The process of mapping single cell atlas is based on selected marker genes. Intricated cell types and subtypes are defined by the provided cell annotations and further validated using cell identity marker gene panels (ciMGPs), 6 so the specificity of ciMGPs is critical to construct the single-cell profile. The current studies aim at evaluating the specificity of ciMGPs for various pulmonary single-cell identities and to define disease-specific alterations of single cell popu-lations labelled with ciMGPs. We screened and selected 57 ciMGPs from previous studies 6–8 and validated the identities of ciMGPs-based cell types/subtypes in lung tissues from healthy subjects or patients with idiopathic pulmonary fibrosis (IPF), chronic obstructive pulmonary disease (COPD), systemic sclerosis (SSC), lung adenocarcinoma (LUAD), large


Dear Editor,
The single cell RNA sequencing (scRNA-seq) technology provides new insights into understanding of single-cell transcriptomic atlas and intercellular communication. 1 scRNA-seq is used to characterize the considerable heterogeneity and complexity of cell type, to uncover the cell fate and context, critical molecular features and progression trajectories, as well as to explore potential pathogenesis and individualized therapeutic targets. [2][3][4] One of major challenges is to identify of biology-specific biomarkers as the cell identity for cell phenotypes and functional subtypes, although bioinformatic and fstatistical methods for scRNA-seq are developed and improved rapidly. 5 More and more cell type/subtypes are identified in response to various stimulus, external challenge and even pathological conditions. The correctness and specificity of cell-specific transcriptomic profiles based on scRNA-seq are highly dependent upon the accuracy of cell type identity and cellular annotation. The process of mapping single cell atlas is based on selected marker genes. Intricated cell types and subtypes are defined by the provided cell annotations and further validated using cell identity marker gene panels (ciMGPs), 6 so the specificity of ciMGPs is critical to construct the single-cell profile.
The current studies aim at evaluating the specificity of ciMGPs for various pulmonary single-cell identities and to define disease-specific alterations of single cell populations labelled with ciMGPs. We screened and selected 57 ciMGPs from previous studies [6][7][8] and validated the identities of ciMGPs-based cell types/subtypes in lung tissues from healthy subjects or patients with idiopathic pulmonary fibrosis (IPF), chronic obstructive pulmonary disease (COPD), systemic sclerosis (SSC), lung adenocarcinoma (LUAD), large cell cancer (LCC) or para-cancer tissues as pair-controls, as detailed in Supplemental Materials. Of those cells, immune cells resident in the lung tissue included nine subtypes of lymphoid cells and 15 subtypes of myeloid cells, 8 and lung parenchymal cells had 15 subtypes of epithelia, nine of endothelia and nine of stromal cells, as presented in Figures S1-S48 and details in Tables S6-S11. We comprehensively assessed and quantified the specificity of each ciMGP representing 57-specific cell subtypes among lung diseases. We firstly developed the criteria and schematic diagram to determine the specificity and accuracy of lung cell ciMGPs, as explained in Supplemental Method. The scRNA-seq data for evaluation were collected from various databases (Tables S1-S3). We defined the overlap expression rate (OER) of ciMGPs in a cell subtype was less than 5%, as the cell-specific marker panel with high specificity, when compared with the expression in other cell types/subtypes, between 5% and 10% as the 'cell-associated marker panel' with moderate specificity, or more than 10% as 'cell-reference marker panel' with low specificity, as explained in Supplemental Method and Figure 5. We dedicate special attention to the alteration of ciMGP's specificity in illness states, which presents profound insight into the clinical promotion and popularization.
We established the transcriptomic profiles of pulmonary single cells based on unified manifold approximation and projection (UMAP), reflecting the abundance and distribution of cell types and subtypes ( Figure 1). The specificity of ciMGPs was evaluated in different cell types, subtypes, locations and diseases. The OER values of B cell ( Figure 2A) and adventitial fibroblast ( Figure 2B) were less than 5% in normal lung tissue or various lung diseases.
The AT1 ciMGPs showed highly specific in most samples except for LCC, as compared with the remaining 56 cell subtypes ( Figure 2C). Of 15 subtypes of lung epithelial cells, the ciMGPs specificity of ciliated cell was the highest in eight lung tissues ( Table 1). The basal epithelia failed to be detected in LUAD samples based on the provided panel, and the efficacy of panel in goblet epithelia was weakened in LCC samples, according to the criteria proposed (     Cell-associated (5%-10%) LGALS2, CD14, NRG1, S100A8, S100A9, S100A12) harvested from normal lung samples and six lung diseases (samples of para-cancer, LCC, LUAD, IPF, COPD, SSC). The panels of signaling_AT2, CD8 naïve T cell and classic monocyte were presented as examples of cell-associated panel with 5%-10% overlap expression rate. The detailed calculation procedure of overlap expression rate can be seen in the supplemental methods.
comparatively decreased in the states of LCC and SSC (Table 1, Figures S4 and S35). However, ciMGPs mRNA expression of bronchial vessel 2 and pulmonary inocyte was hardly detected in all kinds of cells and lung diseases in our research (Table 1, Figures S8 and S24). The representatives of cell-specific panels in the process of cell annotation were summarized and presented in Tables S4 and S5. The specificity and importance of ciMGPs can be dynamically changed on basis of disease nature and severity. The specificity of ciMGPs appeared in myeloid cells ( Figure 3C) was higher than lymphoid cells (Figures 2A  and 4B). The quality of ciMGPs of signaling AT2, CD8 naïve T cell and classic monocyte were considered as 'cellassociated panels' in the normal lung tissue, while became the cluster of 'cell-specific panels' in lung diseases and varied among lung diseases (Figure 3). The signaling AT2 panel clearly up-expressed in AT2 from normal and paracancer lung tissues while became unclear in LUAD and IPF ( Figure 3A). Compared with tissue-resident cells, the ciMGP of immune cells, including CD4 memory effector T, CD4 naïve T, CD8 memory effector T, NK or NKT, showed relatively lower specificity, was more difficult to be annotated, meanwhile, and highly expressed in 2-3 other cell subtypes (Figures S12, S13 and S32, Table 1). It might be attributed to the relatively conserved function of structural cells in lungs, while immune cells exist in intermediate and functional states with continuously dynamical remodeling. Immune cells can be activated by external stimuli, to perform the primary force of host defense in lung, including the process of rapid recruitment and migration. 9 Our data demonstrated that the ciMGPs specificity of capillary intermediate endothelia 2, natural killer and lipofibroblast was low in the majority of lung samples ( Figure 4A-C), as the cell-reference ciMGPs with high OERs. The specificity of ciMGPs has the value for deeply understanding the heterogeneity among various lung diseases and pathological states. Several ciMGPs with tissue-specific pattern of expression have the potential of clinical implications.
The ciMGPs of tissue-resident cells and immune cells showed obvious differences in disease specificity, especially the subtypes of serous and CD4 naïve T cells, of which the specificity of ciMGPs was significantly higher in LUAD and LCC tissues ( Table 1). The ciMGPs of AT1, Club or mesothelial cells showed comparatively low specificity in LCC samples (Table 1, Figures S17 and S27), while the ciMGPs of CD4 naïve T and non-classic monocytes were higher in LCC (Table 1). Some ciMGPs expressed in multiple cell types, subtypes and diseases, for example, capillary intermediate endothelia 2 (Figure 4A), natural killer cells ( Figure 4B) and lipofibroblast ( Figure 4C). It is critical to evaluate the specificity of ciMGPs in normal tissues to precisely define the representatives of cells and set the referenced baseline, in pathological disease tissues to check the abnormal values, and in developmentrelated cell subtypes to clarify the OERs. The cells with similar developmental-lineage often share the common canonical molecular markers and resemble in their expression patterns, which makes it difficult to differ between AT1 and AT2, basophil/mast1 and 2, vascular and airway smooth muscle cells, proximal basal and basal cells, or myeloid dendritic type I and II. We found the variation range of ciMGPs among multiple cell subtypes and even types, for example, ciMGPs signature of dendritic cells also highly expressed in alveolar epithelial cells and basophil/mast cells ( Figures S2, S5 and S6). It implied that they might share the common underlying gene expression pattern and close molecular interaction or crosstalk among diverse types of cells. The novelty of the present study is to comprehensively define and evaluate ciMGP specificity of pulmonary single-cells and proposed the three categories using the OER values, to determine the difference of ciMGP specificity among multiple pathological conditions and to provide new alternatives for the quality control in scRNA-sq data analysis and clinical application, as proposed previously. 10 However, limited scRNAseq data set might impact the extrapolation of the conclusion. The specificity of distinct stages, surgery procedures, lesioned sites of lung diseases and even data derived from different sequencing methods needs to be further validated.
In conclusion, for the first time, we developed the criteria to evaluate the ciMGPs specificity of lung cell types/subtypes from various lung diseases and characterized three categories of cell-specific, cell-associated, and cell-reference ciMGPs on basis of scRNA-seq. The ciMGPs specificity varied among cell types and subtypes, disease natures and stages, as well as responses to therapies as the part of quality control in scRNA-seq analysis, although the evaluation and criteria of ciMGPs need to be further improved and optimized. Thus, we believe that the precise evaluation of ciMGPs specificity is considerably important in bioinformatic analysis, single cell categories, data interpretations and accurate conclusion.

C O N F L I C T O F I N T E R E S T
The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.